Search for: All records

Creators/Authors contains: "Apon, Amy W."

« Prev Next »

Total Resources

2

Resource Type
Conference Paper

2

Conference Proceeding

0

Dataset

0

Journal Article

0

Workshop Report

0

Availability
Full Text / Resource Available

2

Citation Only

0

Save Results
Excel (limit 2000)
CSV (limit 5000)
XML (limit 5000)

Have feedback or suggestions for a way to improve these results?
!

Note: When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher. Some full text articles may not yet be available without a charge during the embargo (administrative interval).
What is a DOI Number?

Some links on this page may take you to non-federal websites. Their policies may differ from this site.

Clustered Latent Dirichlet Allocation for Scientific Discovery

https://doi.org/10.1109/BigData47090.2019.9005964

Gropp, Christopher ; Herzog, Alexander ; Safro, Ilya ; Wilson, Paul W. ; Apon, Amy W. ( December 2019 , 2019 IEEE International Conference on Big Data (Big Data))
null (Ed.)
Topic modeling, a method for extracting the underlying themes from a collection of documents, is an increasingly important component of the design of intelligent systems enabling the sense-making of highly dynamic and diverse streams of text data related but not limited to scientific discovery. Traditional methods such as Dynamic Topic Modeling (DTM) do not lend themselves well to direct parallelization because of dependencies from one time step to another. In this paper, we introduce and empirically analyze Clustered Latent Dirichlet Allocation (CLDA), a method for extracting dynamic latent topics from a collection of documents. Our approach is based on data decomposition in which the data is partitioned into segments, followed by topic modeling on the individual segments. The resulting local models are then combined into a global solution using clustering. The decomposition and resulting parallelization leads to very fast runtime even on very large datasets. Our approach furthermore provides insight into how the composition of topics changes over time and can also be applied using other data partitioning strategies over any discrete features of the data, such as geographic features or classes of users. In this paper CLDA is applied successfully to seventeen years of NIPS conference papers (2,484 documents and 3,280,697 words), seventeen years of computer science journal abstracts (533,588 documents and 46,446,184 words), and to forty years of the PubMed corpus (4,025,976 documents and 386,847,695 words). On the PubMed corpus, we demonstrate the versatility of CLDA by segmenting the data by both time and by journal. Our runtime on this corpus demonstrates an ability to function on very large scale datasets.
more » « less
Full Text Available
Building A Scalable Forward Flux Sampling Framework using Big Data and HPC

https://doi.org/10.1145/3332186.3332205

DeFever, Ryan S. ; Hanger, Walter ; Sarupria, Sapna ; Kilgannon, Jon ; Apon, Amy W. ; Ngo, Linh B. ( January 2019 , Practice and Experience in Advanced Research Computing PEARC19)

Full Text Available